You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Posts with tag Data exploration and visualization


← Back to all posts
Oct 07 2011

As promised, some quick thoughts broken off my post on Dunning Log-likelihood. There, I looked at _big_ corpusestwo history classes of about 20,000 books each. But I also wonder how we can use algorithmic comparison on a much smaller scale: particularly, at the level of individual authors or works. English dept. digital humanists tend to rely on small sets of well curated, TEI texts, but even the ugly wilds of machine OCR might be able to offer them some insights. (Sidenoteinteresting post by Ted Underwood today on the mechanics of creating a middle group between these two poles).

Oct 06 2011

Historians often hope that digitized texts will enable better, faster comparisons of groups of texts. Now that at least the 1grams on Bookworm are running pretty smoothly, I want to start to lay the groundwork for using corpus comparisons to look at words in a big digital library. For the algorithmically minded: this post should act as a somewhat idiosyncratic approach to Dunnings Log-likelihood statistic. For the hermeneutically minded: this post should explain why you might need _any_ log-likelihood statistic.

Aug 04 2011

I mentioned earlier Ive been rebuilding my database; Ive also been talking to some of the people here at Harvard about various follow-up projects to ngrams. So this seems like a good moment to rethink a few pretty basic things about different ways of presenting historical language statistics. For extensive interactions, nothing is going to beat a database or direct access to text files in some form. But for quick interactions, which includes a lot of pattern searching and public outreach, we have some interesting choices about presentation.

Apr 11 2011

Lets start with two self-evident facts about how print culture changes over time:

Feb 22 2011

Heres an animation of the PCA numbers Ive been exploring this last week.

Feb 14 2011

One of the most important services a computer can provide for us is a different way of reading. Its fast, bad at grammar, good at counting, and generally provides a different perspective on texts we already know in one way.

Feb 02 2011

Genre information is important and interesting. Using the smaller of my two book databases, I can get some pretty good genre information about some fields Im interested in for my dissertation by using the Library of Congress classifications for the books. Im going to start with the difference between psychology and philosophy. Ive already got some more interesting stuff than these basic charts, but I think a constrained comparison like this should be somewhat more clear.

Jan 18 2011

Ill end my unannounced hiatus by posting several charts that show the limits of the search-term clustering I talked about last week before I respond to a couple things that caught my interest in the last week.

Jan 11 2011

Because of my primitive search engine, Ive been thinking about some of the ways we can better use search data to a) interpret historical data, and b) improve our understanding of what goes on when we search. As I was saying then, there are two things that search engines let us do that we usually dont get:

Jan 10 2011

More access to the connections between words makes it possible to separate word-use from language. This is one of the reasons that we need access to analyzed texts to do any real digital history. Im thinking through ways to use patterns of correlations across books as a way to start thinking about how connections between words and concepts change over time, just as word count data can tell us something (fuzzy, but something) about the general prominence of a term. This post is about how the search algorithm Ive been working with can help improve this sort of search. Ill get back to evolution (which I talked about in my post introducing these correlation charts) in a day or two, but let me start with an even more basic question that illustrates some of the possibilities and limitations of this analysis: What was the Civil War fought about?

Dec 27 2010

I finally got some call numbers. Not for everything, but for a better portion than I thought I would: about 7,600 records, or c. 30% of my books.

Dec 23 2010

Back to my own stuff. Before the Ngrams stuff came up, I was working on ways of finding books that share similar vocabularies. I said at the end of my second ngrams post that we have hundreds of thousands of dimensions for each book: let me explain what I mean. My regular readers were unconvinced, I think, by my first foray here into principal components, but Im going to try again. This post is largely a test of whether I can explain principal components analysis to people who dont know about it so: correct me if you already understand PCA, and let me know me know whats unclear if you dont. (Or, it goes without saying, skip it.)

Dec 13 2010

Im interested in the ways different words are tied together. Thats sort of the universal feature of this project, so figuring out ways to find them would be useful. I already looked at some ways of finding interesting words for scientific method, but that was in the context of the related words as an endpoint of the analysis. I want to be able to automatically generate linked words, as well. Im going to think through this staying on capitalist as the word of the day. Fair warning: this post is a rambler.

Dec 06 2010

Dan asks for some numbers on capitalism and capitalist similar to the ones on Darwinism and Darwinist I ran for Hank earlier. That seems like a nice big question I can use to get some basic methods to warm up the new database I set up this week and to get some basic functionality written into it.

Dec 04 2010

This verges on unreflective datadumping: but because its easy and I think people might find it interesting, Im going to drop in some of my own charts for total word use in 30,000 books by the largest American publishers on the same terms for which the Times published Cohens charts of title word counts. Ive tossed in a couple extra words where it seems interestingincluding some alternate word-forms that tell a story, using a perl word-stemming algorithm I set up the other day that works fairly well. My charts run from 1830 (there just arent many American books from before, and even the data from the 30s is a little screwy) to 1922 (the date that digital history endsthank you, Sonny Bono.) In some cases, (that 1874 peak for science), the American and British trends are surprisingly close. Sometimes, they arent.

Dec 03 2010

So I just looked at patterns of commemoration for a few famous anniversaries. This is, for some people, kind of interestinghow does the publishing industry focus in on certain figures to create news or resurgences of interest in them?  I love the way we get excited about the civil war sesquicentennial now, or the Darwin/Lincoln year last year.

Dec 03 2010

I was starting to write about the implicit model of historical change behind loess curves, which Ill probably post soon, when I started to think some more about a great counterexample to the gradual change Im looking for: the patterns of commemoration for anniversaries. At anniversaries, as well as news events, I often see big spikes in wordcounts for an event or person.

Nov 26 2010

Nov 17 2010

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; min-height: 14.0px}

Nov 12 2010

All right, lets put this machine into action. A lot of digital humanities is about visualization, which has its place in teaching, which Jamie asked for more about. Before I do that, though, I want to show some more about how this can be a research tool. Henry asked about the history of the term scientific method. I assume he was asking a chart showing its usage over time, but I already have, with the data in hand, a lot of other interesting displays that we can use. This post is a sort of catalog of what some of the low-hanging fruit in text analysis are.

Nov 11 2010

I now have counts for the number of books a word appears in, as well as the number of times it appeared. Just as I hoped, it gives a new perspective on a lot of the questions we looked at already. That telephone-telegraph-railroad chart, in particular, has a lot of interesting differences. But before I get into that, probably later today, I want to step back and think about what we can learn from the contrast between between wordcounts and bookcounts. (Im just going to call them bookcountsI hope thats a clear enough phrase).

Nov 10 2010

Heres what googling that question will tell you: about 400,000 words in the big dictionaries (OED, Websters); but including technical vocabulary, a million, give or take a few hundred thousand. But for my poor computer, thats too many, for reasons too technical to go into here. Suffice it to say that Im asking this question for mundane reasons, but the answer is kind of interesting anyway. No real historical questions in this post, thoughIll put the only big thought I have about it in another post later tonight.